Quality: Programmatic Assessment 2

Quality Programatic Assessment 2

Quiz

Using the results of the programmatic assessment in the Jupyter Notebook below, identify the results that are indicative of data quality issues in the following quizzes.

Quality: Programmatic Assessment

Which of the following part of the programmatic assessments in the Jupyter Notebook below are indicative of data quality issues? (Hint: Make sure to look for variations of the same name.)

Value count for the surname 'Doe' is 6

'Jake Jakobsen' is a duplicated name

'Elizabeth Knudsen' is a duplicated name

Lowest weight is 48.8 lbs

Indexes in the sort_values results are out of order

No null entries are returned from sum and isnull on the auralin and novodra columns

SOLUTION:

Value count for the *surname* 'Doe' is 6
'Jake Jakobsen' is a duplicated name
Lowest weight is 48.8 lbs
No null entries are returned from `sum` and `isnull` on the *auralin* and *novodra* columns

Workspace

This section contains either a workspace (it can be a Jupyter Notebook workspace or an online code editor work space, etc.) and it cannot be automatically downloaded to be generated here. Please access the classroom with your account and manually download the workspace to your local machine. Note that for some courses, Udacity upload the workspace files onto https://github.com/udacity , so you may be able to download them there.

Workspace Information:

Default file path:
Workspace type: jupyter
Opened files (when workspace is loaded): n/a

Solution

Quality Programatic Assessment 2 Solution

*Note: while the default John Doe data is a validity issue as described in the video, it is also a completeness issue because this default data displaced real patient data that is no longer in the *patients* table. Because completeness is more "severe" than validity, completeness is likely the more appropriate data quality dimension. This distinction is more appropriate to note because missing data is usually best addressed first when cleaning data, as you'll experience in Lesson 4. However, let's assume that this overwritten data can't be recovered, which makes treating it as a validity issue okay.*

'Elizabeth Knudsen' being a duplicated name isn't a data quality issue because 'Elizabeth Knudsen' is not a duplicated name. Her demographic information, which is filled with NaN entries, are duplicated though (since there are patients records with missing address, city, state, etc. information.

The indexes of the series returned by sort_values on the weight column patients table are supposed to be out of order since the original dataset isn't sorted by weight.